This comes from the file content/analysis.Rmd.
We describe here our detailed data analysis.
The goal of this data analysis is to explore the relationship between asthma hospitalizations, as a measure of the human impact of air pollution, and the amount of green space in different California counties. We hope to learn whether higher levels of asthma hospitalization correlate to lower proportions of green space.
Although asthma hospitalization rates are not a perfect measure of air pollution, they are strongly linked, and California county data on asthma hospitalization rates is publicly available. Within asthma hospitalization rates, we’ll look specifically at age groups and race to assess the varied hospitalization rates of different groups. Racial makeup of a county is often correlated with socioeconomic status, and therefore we hope to examine whether areas with a greater POC (people of color) population have both higher hospitalization rates and less green space, which are both factors that can correlate with areas of lower socioeconomic status. However, it’s important to note that race is certainly not an exact metric for socio-economic status. Our observations may have implications on how socioeconomic levels of a county are linked to differing levels of asthma hospitalizations or green space, but we will not be conclusively defining that correlation within this study.
To assess each county’s green space, the proportion of county area which is parkland and the number of parks per county are used. Both are included because they look at green space in two different ways, and considering area helps to account for the fact that parks can be drastically different sizes. Park land data is an imperfect measure of tree or plant coverage because urban parks can contain few plants. Rural regions can have large areas of plant coverage remaining in private hands, and these areas can improve air quality despite not being open to the public. However, parks data is publicly available, and does generally give a good idea of the number of parks most public citizens in a county should have access to, as well as the area that these public green spaces cover. It would be helpful to have additional data that analyzes all land cover and divides it into percentages of grass cover, forest cover, building cover, and street cover, or similar categories, but that data was not available at this point.
Some of the major questions we are interested in answering include: Does a higher amount of open green spaces or a higher number of public parks correlate to lower asthma hospitalization rates across California counties? What are the differences in racial and age makeup of hospitalizations across counties? What is the relationship between the racial makeup of asthma hospitalizations and the amount of greenspace in a county? What is the relationship between age makeup of hospitalizations and greenspace? Is there a difference in the relationship between number of open parks and asthma hospitalization rate, and proportion of park land and asthma hospitalization rate? Which serves as a better predictor?
While exploring the data, we first examined how open greenspace and number of parks differed across California counties. We focused on the variables open park land, number of parks, and proportion of park-to-total county land. We discovered that although most counties only use about 10% of their land for parks, counties often have a high number of outlier areas. These areas are census tracts, which contain between 1200-8000 people each, which means some specific areas within each county have much larger amounts of open green space than the average. We also looked generally at the relationship between open park land and age-adjusted hospitalization rate by county but did not see any major correlation.
## [1] "asthmaCA_kids" "asthmaCA_kids_2"
## [3] "asthmaCA_kids_v_adults" "asthmaCA_race_ethnicity"
## [5] "clean_parks_data" "parks_asthmaCA_kids"
## [7] "parks_asthmaCA_kids_2" "parks_asthmaCA_kids_v_adults"
## [9] "parks_asthmaCA_race_ethnicity"
STATS
parks_asthmaCA_kids_v_adults_countystats <- parks_asthmaCA_kids_v_adults %>% group_by(county_name) %>%
mutate(county_open_park_area = sum(total_open_park_area_sqmiles), county_open_parks = sum(open_parks_tract), county_number_hospitalizations = sum(number_hospitalizations), county_avg_open_park_area = mean(total_open_park_area_sqmiles), county_avg_open_parks = mean(open_parks_tract), county_avg_hospitalizations = mean(number_hospitalizations))
DATA: parks_asthma_CA_kids_v_adults
parks_asthmaCA_kids_v_adults %>%
group_by(county_name) %>%
ggplot(aes(x = county_name %>% reorder(total_open_park_area_sqmiles), y = total_open_park_area_sqmiles, color = number_hospitalizations)) + geom_point() + theme(axis.text.x = element_text(angle = 90)) + labs(x = "County Name", y = "County Tract Open Park Area (square miles)", color = "County Tract Number of Hospitalizations") + scale_color_viridis_c(option = "plasma")
parks_asthmaCA_kids_v_adults_countystats %>%
ggplot(aes(x = county_name %>% reorder(county_open_park_area), y = county_open_park_area, color = county_number_hospitalizations)) + geom_point() + theme(axis.text.x = element_text(angle = 90)) + labs(x = "County Name", y = "County Open Park Area (square miles)", color = "County Number of Hospitalizations") + scale_color_viridis_c(option = "plasma")
parks_asthmaCA_kids_v_adults_countystats %>%
ggplot(aes(x = county_name %>% reorder(county_avg_open_park_area), y = county_avg_open_park_area, color = county_avg_hospitalizations)) + geom_point() + theme(axis.text.x = element_text(angle = 90)) + labs(x = "County Name", y = "County Average (by tract) Open Park Area (square miles)", color = "County Average (by tract) Hospitalizations") + scale_color_viridis_c(option = "plasma")
parks_asthmaCA_kids_v_adults_countystats %>%
filter(!is.na(age_adjusted_hospitalization_rate)) %>%
group_by(county_name) %>%
ggplot(aes(x = county_name %>% reorder(age_adjusted_hospitalization_rate), y = age_adjusted_hospitalization_rate, color = county_open_park_area)) + geom_point() + facet_wrap(~ strata_name) + theme(axis.text.x = element_text(angle = 90)) + labs(x = "County", y = "Age-Adjusted Hospitalization Rate", color = "County Open Park Area") + scale_color_viridis_c()
parks_asthmaCA_kids_v_adults_countystats %>% filter(!is.na(age_adjusted_hospitalization_rate)) %>% group_by(county_name) %>%
ggplot(aes(x = county_name %>% reorder(number_hospitalizations),
y = number_hospitalizations, color = county_open_park_area)) +
geom_point() + facet_wrap(~ strata_name) +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "County", y = "Number of Hospitalizations", color = "County Open Park Area") + scale_color_viridis_c()
parks_asthmaCA_kids_v_adults_countystats <- parks_asthmaCA_kids_v_adults_countystats %>%
filter(!is.na(county_number_hospitalizations), !is.na(county_open_park_area))
mod1 <- lm(county_number_hospitalizations ~ county_open_park_area, data = parks_asthmaCA_kids_v_adults_countystats)
beta <- coef(mod1)
parks_asthmaCA_kids_v_adults_countystats %>% ggplot(aes(x = county_open_park_area, y = county_number_hospitalizations)) + geom_point() + geom_abline(intercept = beta[1], slope = beta[2], color = "red")
DATA: parks_asthmaCA_race_ethnicity
parks_data_by_tract <- parks_asthmaCA_race_ethnicity %>%
group_by(tractcode) %>%
summarize(open_parks_tract = mean(open_parks_tract), tract_area_sqmiles = mean(tract_area_sqmiles), total_open_park_area_sqmiles = mean(total_open_park_area_sqmiles), county_name = county_name)
## `summarise()` has grouped output by 'tractcode'. You can override using the
## `.groups` argument.
parks_data_by_county <- parks_data_by_tract %>%
group_by(county_name) %>%
summarize(total_county_area_sqm = sum(tract_area_sqmiles), total_county_park_area_sqm = sum(total_open_park_area_sqmiles), county_num_parks = sum(open_parks_tract))
hospitalizations_white <- parks_asthmaCA_race_ethnicity %>%
filter(race_ethnicity == "White") %>%
group_by(county_name) %>%
summarize(number_hospitalizations_white = mean(number_hospitalizations))
hospitalizations_black <- parks_asthmaCA_race_ethnicity %>%
filter(race_ethnicity == "Black") %>%
group_by(county_name) %>%
summarize(number_hospitalizations_black = mean(number_hospitalizations))
hospitalizations_hispanic <- parks_asthmaCA_race_ethnicity %>%
filter(race_ethnicity == "Hispanic") %>%
group_by(county_name) %>%
summarize(number_hospitalizations_hispanic = mean(number_hospitalizations))
hospitalizations_asian_pi <- parks_asthmaCA_race_ethnicity %>%
filter(race_ethnicity == "Asian/PI") %>%
group_by(county_name) %>%
summarize(number_hospitalizations_asian_pi = mean(number_hospitalizations))
hospitalizations_ai_an <- parks_asthmaCA_race_ethnicity %>%
filter(race_ethnicity == "AI/AN") %>%
group_by(county_name) %>%
summarize(number_hospitalizations_ai_an = mean(number_hospitalizations))
hospitalizations_poc <- asthmaCA_race_ethnicity %>%
filter(!STRATA_NAME == "White") %>%
group_by(COUNTY) %>%
summarize(number_hospitalizations_poc = sum(NUMBER_OF_HOSPITALIZATIONS))
hospitalizations_total <- asthmaCA_race_ethnicity %>% group_by(COUNTY) %>% summarise(NUMBER_OF_HOSPITALIZATIONS = sum(NUMBER_OF_HOSPITALIZATIONS))
CA_county_asthma_parks <-
full_join(parks_data_by_county, hospitalizations_white, by = c("county_name" = "county_name"))
CA_county_asthma_parks <- CA_county_asthma_parks %>%
full_join(hospitalizations_poc, by = c("county_name" = "COUNTY"))
CA_county_asthma_parks <- CA_county_asthma_parks %>%
full_join(hospitalizations_hispanic, by = c("county_name" = "county_name"))
CA_county_asthma_parks <- CA_county_asthma_parks %>%
full_join(hospitalizations_asian_pi, by = c("county_name" = "county_name"))
# as the AI/AN category values were all 0 or NA, we decided it was not useful to keep analyzing that information
CA_county_asthma_parks <- CA_county_asthma_parks %>%
full_join(hospitalizations_black, by = c("county_name" = "county_name"))
CA_county_asthma_parks <- CA_county_asthma_parks %>%
full_join(hospitalizations_total, by = c("county_name" = "COUNTY"))
DATA: parks_asthma_CA_kids
sums <- parks_asthmaCA_kids %>% group_by(county_name) %>% mutate(total_tract_area = sum(tract_area_sqmiles), total_open_park_area = sum(total_open_park_area_sqmiles), num_open_parks = sum(open_parks_tract))
avgs <- parks_asthmaCA_kids %>% group_by(county_name) %>% mutate(avg_tract_area = mean(tract_area_sqmiles), avg_open_park_area = mean(total_open_park_area_sqmiles), avg_num_open_parks = mean(open_parks_tract))
ggplotly(sums %>% ggplot(aes(num_open_parks, number_hospitalizations, col = county_name)) + geom_point() + labs(x = "Count of Open Parks", y = "Hospitalizations", col = "County Name"))
ggplotly(sums %>% ggplot(aes(num_open_parks, age_adjusted_hospitalization_rate, col = county_name)) + geom_point() + labs(x = "Open Parks", y = "Hospitalizations Rate", col = "County Name"))
ggplotly(sums %>% ggplot(aes(x = total_open_park_area, y = number_hospitalizations, col = age_adjusted_hospitalization_rate)) + geom_jitter() + labs(x = "Open Parks Area", y = "Hospitalizations", col = "Hospitalization Rate"))
ggplotly(sums %>% ggplot(aes(x = total_open_park_area, y = number_hospitalizations, col = county_name)) + geom_jitter() + labs(x = "Open Parks", y = "Hospitalizations", col = "County Name"))
If you are directly quoting from a source, please make that clear.
You can show quotes using > like this
> To be or not to be.
To be or not to be.
Also, make sure to provide a link or citation to where you are quoting from.
you will
NOTE: Your Data Analysis can be broken up into multiple pages if that helps with your organization.